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(54) Voice activated control unit 

(57) A hand-held wireless voice-activated device 
(10) for controlling a host system (11), such as a com- 
puter connected to the World Wide Web. The device 
(10) has adisplay (1 0a), a microphone (1 0b), and a wire- 
less transmitter (10g) and receiver (10h). It may also 



have a processor (10e) and memory (lof) for performing 
voice recognition. A device (20) can be specifically de- 
signed for Web browsing, by having a processor (20e) 
and memory (20f) that perform both voice recognition 
and interpretation of results of the voice recognition. 
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Description 

TECHNICAL FIELD OF THE INVENTION 

The present invention relates generally to voice rec- 
ognition devices, and more particularly to a wireless 
voice-controlled device that permits a user to browse a 
hypermedia network, such as the World Wide Web, with 
voice commands. 

BACKGROUND OF THE INVENTION 

The Internet is a world-wide computer network, or 
more accurately, a world-wide network of networks. It 
provides an exchange of information and offers a vast 
range of services. Today, the Internet has grown so as 
to include all kinds of institutions, businesses, and even 
individuals at their homes. 

The World-Wide Web ("WWW" or 'Web") is one of 
the services available on the Internet. It is based on a 
technology known as "hypertext - , in which a document 
has links to its other parts or to other documents. Hy- 
pertext has been extended so as to encompass links to 
any kind of information that can be stored on a computer, 
including images and sound. For example, using the 
Web, from within a document one can select highlighted 
words or phases to get definitions, sources, or related 
documents, stored anywhere in the world. For this rea- 
son, the Web may be described as a "hypermedia" net- 
work. 

The basic unit in the Web is a "page", a (usually) 
text-plus-graphics document with links to other pages. 
"Navigating" the Web primarily means moving around 
from page to page. 

The idea behind the Web is to collect all kinds of 
data from all kinds of sources, avoiding the problems of 
incompatibilities by allowing a smart server and a smart 
client program to deal with the format of the data. This 
capability to negotiate formats enables the Web to ac- 
cept all kinds of data, including multimedia formats, 
once the proper translation code is added to the servers 
and clients. The Web client is used to connect to and to 
use Web resources located on Web servers. 

One type of client software used to access and use 
the Web is referred as "web browsers" software. This 
software can be installed on the user's computer to pro- 
vide a graphic interface, where links are highlighted or 
otherwise marked for easy selection with a mouse or 
other pointing device. 

SUMMARY OF THE INVENTION 

One aspect of the invention is a wireless voice-ac- 
tivated control unit for controlling a processor-based 
host system, such as a computer connected to the 
World wide Web. A compact hand-held unit has a mi- 
crophone, a wireless audio input transmitter, a wireless 
data receiver, and a display. The microphone receives 
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voice input from a user, thereby providing an audio input 
signal. The audio transmitter wirelessly transmits data 
derived from the audio signal to the host system. After 
the host acts on the audio input, it delivers some sort of 

5 response-in the form of image data wirelessly delivered 
to the receiver. A display generates and displays images 
represented by the image data. 

Variations of the device can include a speaker for 
audio output information. The device can also have a 

10 processor and memory for performing front-end voice 
recognition processes or even all of the voice recogni- 
tion. 

An advantage of the invention is that it makes infor- 
mation on the Web more accessible and useful. Speech 
»s control brings added flexibility and power to the Web in- 
terface and makes access to information more natural. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention will now be further described 
by way of example, with reference to the accompanying 
drawings in which: 

FIGURE 1 illustrates one embodiment of a wireless 
voice-activated control unit in accordance with the in- 
vention. 

FIGURE 2 illustrates another embodiment of a wire- 
less voice-activated control unit, specially configured for 
translating and interpreting audio input from the user. 

FIGURE 3 illustrates an example of a display pro- 
vided by the speakable command process. 

FIGURE 4 illustrates a portion of a Web page and 
its speakable links. 

FIGURE 5 illustrates a process of dynamically cre- 
ating grammars for use by the voice recognizer of FIG- 
URES 1 and 2. 

DETAILED DESCRIPTION OF THE INVENTION 

The invention described herein is directed to a wire- 
less voice-activated device for controlling a processor- 
based host system. That is, the device is a voice-acti- 
vated remote control device. In the example of this de- 
scription, the host system is a computer connected to 
the World-Wide Web and the device is used for voice- 
controlled web browsing. However, the same concepts 
can be applied to a voice-controlled device for control- 
ling any processor-based system that provides display 
or audio information, for example, a television. 

various embodiments of the device differ with re- 
gard to the "intelligence" embedded in the device. For 
purposes of the invention, the programming used to rec- 
ognize an audio input and to interpret the audio input so 
that it can be used by conventional Web browser soft- 
ware is modularized in a manner that permits the extent 
of embedded programming to become a matter of de- 
sign and cost. 

FIGURE 1 illustrates one embodiment of a wireless 
voice-activated control unit 10 in accordance with the 
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invention. It communicates with a host system 11. As 
stated above, for purposes of this description, host sys- 
tem 1 1 is a computer and is in data communication with 
the World-Wide Web. 

Control unit 10 has a display 10a and a microphone 
10b. Display 10a is designed for compactness and port- 
ability, and could be an LCD. Microphone lob receives 
voice inputfrom a user. It may have a "mute" switch 10c, 
so that control unit 10 can be on, displaying images and 
even receiving non-audio input via an alternative input 
device such as a keypad (not shown), but not performing 
voice recognition. Microphone 10b may be a micro- 
phone array, to enhance the ability to differentiate the 
user's voice from other sounds. 

In the embodiment ol FIGURE 1 , control unit 1 0 per- 
forms all or part of the voice recognition process and 
delivers speech data to host computer 1 1 via transmitter 
log. Host computer 1 1 performs various voice control in- 
terpretation processes and also executes a web brows- 
er. However, in its simplest form control unit would trans- 
mit audio data directly from microphone 1 0b to host sys- 
tem 11, which would perform all processing. 

In the case where control unit 10 performs all or part 
of the voice recognition process, control unit 10 has a 
processor 10e. Memory 10f stores voice recognition 
programming to be executed by processor 1 0e. An ex- 
ample of a suitable processor 10a for speech recogni- 
tion is a signal processor, such as those manufactured 
by Texas Instruments Incorporated. Where microphone 
10b is a microphone array, processor 10a may perform 
calculations for targeting the user's voice. 

If control unit performs only some voice processing, 
it may perform one or more of the "front end" processes, 
such as linear predictive coding (LPC) analysis or 
speech end pointing. 

If control unit 10 performs all voice recognition proc- 
esses, memory 1 0f stores these processes (as a voice 
recognizer) as well as grammar files. In operation, the 
voice recognizer receives audio input from microphone 
10b, and accesses the appropriate grammar file. A 
grammar file handler converts the grammar to speech- 
ready form, creating a punctuation grammar, and load- 
ing the grammar into the voice recognizer. The voice 
recognizer uses the grammar file to convert the audio 
input to a text translation. 

The grammar files in memory 10f may be pre-de- 
fined and stored or may be dynamically created or may 
be a combination of both types of grammar files. An ex- 
ample of dynamic grammar file creation is described be- 
low in connection with FIGURE 5. The grammars may 
be written with the Backus-Naur form of context-free 
grammars and can be customized. In the embodiment 
of FIGURE 1, and where unit 10 is used for Web brows- 
ing, host computer 11 delivers the HTML (hyertext 
markup language) for a currently displayed Web page 
to unit 10. Memory 10f stores a grammar file generator 
for dynamically generating the grammar. In alternative 
Web browsing embodiments, host 11 could dynamically 



generate the grammar and download the grammar file 
to control unit 10. 

The output of the voice recognizer is speech data. 
The speech data is transmitted to host system 11, which 
s performs voice control interpretation processes. Various 
voice control interpretation processes for voice-control- 
led Web browsing are described in U.S. Patent Applica- 
tion Serial No. 08/41 9,229, entitled "Voice Activated Hy- 
permedia Systems Using Grammatical Metadata", as- 
signed to Texas Instruments Incorporated and are incor- 
porated herein by reference. As a result of the interpre- 
tation, the host system 11 may respond to the voice input 
to control unit 10 by executing a command or providing 
a hypermedia (Web) link. 

An example of voice control interpretation other 
than for Web browsing is for commands to a television, 
where host system 11 is a processor-based television 
system. For example, the vocal command, "What's on 
TV tonight?", would result in a display of the television 
schedule. Another example of voice control interpreta- 
tion other than for Web browsing is for commands for 
computer-based household control. The vocal com- 
mand, "Show me the sprinkler schedule" would result in 
an appropriate display. 

After host system 11 has taken the appropriate ac- 
tion, a wireless receiver 10h receives data from host 
system 11 for display on display 10a or for output by 
speaker 10d. Thus, the data received from host system 
11 may be graphical (including text, graphics, images, 
and video) or audio. 

FIGURE 2 illustrates an alternative embodiment of 
the invention, a wireless voice-activated control unit 20 
that performs voice control interpretation as well as 
voice recognition. The voice control interpretation is 
specific to browsing a hypermedia resource, such as the 
Web. The host system 21 is connected to the hyperme- 
dia resource. 

Control unit 20 has components similar to those of 
control unit lo. However, its processor 20e performs ad- 
ditional programming stored in memory 20f . Specifically, 
the voice control interpretation processes may comprise 
a speakable command process, a speakable hotlist 
process, or a speakable links process. These processes 
and their associated grammar files reside on control unit 
20. 

The speakable command process displays a com- 
mand interface on display 20a and accepts various Web 
browsing commands. The process has an associated 
grammar file for the words and phrases that may be spo- 
ken by the user. 

FIGURE 3 illustrates an example of a display 30 
provided by the voice control interpretation process. 
One speakable command is a "Help" command, activat- 
ed with a button 31 . In response, the command process 
displays a "help page" that describes how to use voice- 
controlled browsing. 

Another speakable command is, "Show me my 
speakable command list". Speaking this command dis- 
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plays a page listing a set of grammars, each represent- 
ing a speakable command. Examples are pagedown_ 
command, back_command, and help_command. When 
the command process receives a translation of one of 
these commands, it performs the appropriate action. 

FIGURE 3 also illustrates a feature of the voice rec- 
ognizer that is especially useful for Web browsing. The 
user has spoken the words, "What is the value of XYZ 
stock?" Once the voice recognizer recognizes an utter- 
ance, it determines the score and various statistics for 
time and memory use. As explained below, the request 
for a stock value can be a hotlist item, permitting the 
user to simply voice the request without identifying the 
Web site where the information is located. 

Another speakable command is 'Show me my 
speakable hotlist", activated by button 33. A "hotlisf is 
a stored list of selected Uniform Resource Locators 
(URLs), such as those that are frequently used. Hotlists 
are also known as bookmarks. URLs are a well known 
feature of the Web, and provide a short and consistent 
way to name any resource on the Internet. Atypical URL 
has the following form: 

http://www.ncsa.uiic.edu/General/NCSAHome. 

html 

The various parts of the URL identify the transmission 
protocol, the computer address, and a directory path at 
that address. URLs are also known as "links" and "an- 
chors". 

The speakable hotlist process permits the user to 
construct a grammar for each hotlist item and to asso- 
ciate the grammar with a URL. To create the grammar, 
the user can edit an ASCII grammar file and type in the 
grammar using the BNF syntax. For example, a gram- 
mar for retrieving weather information might define 
phrases such as, "How does the weather look today?" 
and "Give me the weather". The user then associates 
the appropriate URL with the grammar. 

The hotlist grammar file can be modified by voice. 
For example, a current page can be added as a hotlist 
item. Speaking the phrase, "Add this page to my hotlist" 
adds the title of the page to the grammar and associates 
that grammar with the current URL. Speaking the 
phrase, "Edit my speakable hotlist", permits the user to 
edit the grammar by adding additional phrases that will 
cause the page to be retrieved by voice. 

The speakable hotlist process is activated when the 
voice recognizer recognizes a hotlist translation from 
the hotlist grammar file and passes the translation to the 
hotlist process. The hotlist process looks up the associ- 
ated URL. It passes the URL to the browser residing on 
host computer 11 (via wireless communication), so that 
the web page may be retrieved and transmitted to the 
voice control unit 10 for display on display 10a. 

The grammar files for speakable commands and 
the speakable hotlist are active at all times. This permits 
the user to speak the commands or hotlist links in any 
context. A speakable links process may also reside in 
memory 20e of voice control unit 20. Selected informa- 
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tion in a Web page may provide links, for access to other 
web pages. Links are indicated as such by being under- 
lined, highlighted, differently colored, outlined as in the 
case of pictures, or otherwise identified. Instead of using 

s a mouse or other pointing device to select a link, the 
user of voice control unit 10 may speak a link from a 
page being display on display 10a. 

FIGURE 4 illustrates a portion of a Web page 40 
and its links. For example, the second headline 41 is a 

10 link. 

The grammar for speakable links includes the full 
phrase as well as variations. In addition to speaking the 
full phase, the speaker may say "Diana in N period Y 
period" (a literal variation), "Diana in NY", or "Diana in 
is New York*. 

Making a link speakable first requires obtaining the 
link/URL pair from its Web page. Because a Web page 
in HTML (hypertext markup language) format can have 
any length, the number of candidate link/URL pairs that 
the recognizer searches may be limited to those that are 
visible on a current screen of display 20a. A command 
such as, "Scroll down", updates the candidate link/URL 
pairs. Once the link/URL pairs for a screen are obtained, 
a grammar is created for the all the links on the current 
screen. Next, tokens in the links are identified and gram- 
mars for the tokens are created. These grammars are 
added to the recognizer's grammar files, correct tokeni- 
zation is challenging because link formats can vary 
widely. Links can include numbers, acronyms, invented 
words, and novel uses of punctuation. 

Other challenges for speakable links are the length 
of links, ambiguity of links in the same page, and graph- 
ics containing bit-mapped links. For long links, the 
speakable links process permits the user to stop speak- 
ing the words in a link any time after N words. For am- 
biguity, the process may either default to the first URL 
or it may offer a choice of URLs to the user. For bit- 
mapped links, the process uses an <ALT> tag to look 
for link information. 

The grammars for speakable links may be dynam- 
ically created so that only the grammar for a current dis- 
play is active and is updated when a new current display 
is generated. Dynamic grammar creation also reduces 
the amount of required memory I0f. 

FIGURE 5 illustrates a suitable process of dynam- 
ically creating grammar files. This is the process imple- 
mented by the dynamic grammar generator of FIG- 
URES 1 and 2. As explained above, dynamic grammar 
files are created from current Web pages so that speak- 
able links may be recognized. U.S. Patent Application 
Serial No. 08/419,226, incorporated by reference 
above, further describes this method as applied to a 
voice-controlled host system 11, that is, voice control 
without a separate remote control device 10. 

A display, such as the display 40 of FIGURE 4, af- 
fects grammar constraints 52. The grammar constraints 
52 are input into a vocabulary 54 and the user agent 64. 
In turn, the vocabulary 54 feeds the online dictionary 56, 
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which inputs into the pronunciations module 58. The 
pronunciations module 58, as well as the speaker Inde- 
pendent Continuous Speech Phonetic Models module 
60, input into the User Agent 64. In addition, the Speech 
module 66 inputs the user's speech into the user Agent 
64. In parallel, the Context module 68 gets inputs from 
the screen 40 and inputs into the User Agent 64. 

An existing RGDAG (Regular Grammar Directed 
Acyclic Graph) may dynamically accommodate new 
syntax and vocabulary. Every time the screen 40 chang- 
es, the user agent 64 creates a grammar containing the 
currently visible underlined phrases (links). From this 
grammar, the user agent 64 tokenizes the phrases to 
create phrase grammars that can include, for example, 
optional letter spelling and deleted/optional punctuation. 
From the tokens, the user agent 64 creates phonetic 
pronunciation grammars using a combination of online 
dictionaries and a text-to-phoneme mapping. The voice 
recognition process then adds the grammars created. 
Th is involves several simple bookkeeping operations for 
the voice recognizer, including identifying which sym- 
bols denote "words" to output. Finally, global changes 
are implemented to incorporate the new/changed gram- 
mars. For this, the grammars are connected in an RGD- 
AG relationship. In addition, the maximum depth for 
each symbol is computed. It is also determined whether 
the voice recognizer requires parse information by look- 
ing for ancestor symbols with output. Then the structure 
of the grammar for efficient parsing is identified. 

Although the invention has been described with ref- 
erence to specific embodiments, this description is not 
meant to be construed in a limiting sense. Various mod- 
ifications of the disclosed embodiments, as well as al- 
ternative embodiments, will be apparent to persons 
skilled in the art. 



Claims 

1. A voice-activated control unit for controlling a 
processor-based host system, comprising: 

a microphone for receiving voice input from a 
user, thereby providing an audio input signal; 
an audio transmitter for transmitting data de- 
rived from said audio input signal to said host 
system; 

a data receiver for receiving image data from 
said host system; and 

a display for generating display images repre- 
sented by said image data. 

2. The control unit of Claim 1 , wherein said micro- 
phone is switchable to an "ON" or "OFF" state sep- 
arately from said display. 

3. The control unit of Claim 1 or Claim 2, wherein 
said microphone is a multi-element microphone ar- 



ray. 

4. The control unit of any of Claims 1 to 3, further 
comprising a processor for performing a voice rec- 

s ognition process and a memory for storing said 
voice recognition process and grammar files. 

5. The control unit of Claim 4, wherein said voice 
recognition process comprises linear predictive 

10 coding analysis, and said transmitter is operable to 
transmit the results of said analysis. 

6. The control unit of Claim 4, wherein said voice 
recognition process comprises speech end pointing 

is analysis, and said transmitter is operable to trans- 
mit the results. of said analysis. 

7. The control unit of Claim 4, wherein said grammar 
files are dynamically created, and said processor is 

20 further operable to perform a dynamic grammar 
generation process. 

8. A voice-activated control unit for voice-control of 
a host system in data communication with a hyper- 

25 media resource, comprising: 

a microphone for receiving voice input from a 
user, thereby generating an audio input signal; 
an audio transmitter for transmitting data de- 
30 rived from said audio input signal to said host 

system; and 

a data receiver for receiving image data from 
said host system; and 

a display for generating display images repre- 
ss sented by said image data and retrieved from 
said hypermedia resource by said host system. 

9. The control unit of Claim 8, further comprising a 
processor for performing a voice recognition proc- 

40 ess and a memory for storing said voice recognition 
process and grammar files. 

11. The control unit of Claim 9, wherein said voice 
recognition process comprises linear predictive 

45 coding analysis, and said transmitter is operable to 
transmit the results of said analysis. 

12. The control unit of Claim 9, wherein said voice 
recognition process comprises speech end pointing 

so analysis, and said transmitter is operable to trans- 
mit the results of said analysis. 

13. The control unit of Claim 9, wherein said gram- 
mar files are dynamically created, said processor is 

ss further being operable to perform a dynamic gram- 
mar generation process. 

14. The control unit of any of Claims 8 to 13, further 
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comprising a processor lor performing voice control 
process and a memory for storing said voice control 
process. 

15. The control unit of Claim 14, wherein said voice 5 
control process comprises a speakable commands 
process such that said user may vocally direct the 
operations of said host system. 

16. The control unit of Claim 14 or Claim 15, wherein 10 
said voice control processes comprise a speakable 
hotlist process such that said user may vocally re- 
quest a particular one of said resources to be re- 
trieved by said host system. 

is 

17. The control unit of any of Claims 14 to 16, 
wherein said voice control processes comprise a 
speakable links process such that said user may vo- 
cally request that a link on a current page being dis- 
played on said display be retrieved by said host sys- so 
tern. 

18. The control unit of any of Claims 8 to 17, further 
comprising a processor for performing voice recog- 
nition processes and for performing dynamic gram- 2$ 
mar creation processes, and a memory for storing 
said processes. 

19. A method of voice-activated control of a proc- 
essor-based host system comprising: 30 

receiving a voice input from a user and gener- 
ating an audio input signal therefrom; 
transmitting data derived from said audio input 
signal to said host system; 3S 
receiving image data from said host system; 
and 

generating display images represented by said 
image data. 



45 



50 



ss 



EP 0 854 417 A2 



10 



\ 



VOICE-ACTIVATED CONTROL UNIT 



DISPLAY 



10a 



VOICE INPUT 
TRANSMITTER 



10b 10c 



lOd 



10e^ 



PROCESSOR 



10f- 



-lOg 



I Oh 



DATA 
RECEIVER 



MEMORY 



VOICE 
RECOGNIZER 

GRAMMAR 
FILES 

DYNAMIC 
GRAMMAR 
GENERATOR 



FIG. 1 



HOST COMPUTER 



PROCESSOR 



11a 



11b 



J^WORY_ 

VOICE 
CONTROL 
INTERPRETER 

GRAMMAR 
FILES 

WEB 
BROWSER 



•11 



WW 



54 

A. 



vocabulary H ccSmts 



52 




— 7^ — 7 ( 1 * 

56 fcwiiffo iMnFPnmFMr \ s I SPEECH J 



5S 



CONTEXT 



SPEAKER INDEPENDENT 
CONTINUOUS SPEECH 
PHONETIC MODELS 

7 



60 



FIG. 5 



7 



EP0 854 417 A2 



20 



\ 



VOICE-ACTIVATED CONTROL UNIT 



20a 



\ 



DISPLAY 



20b 



20d 



-DO 



20e 



PROCESSOR 



20f 



VOICE 
RECOGNIZER 

VOICE 
CONTROL 
INTERPRETER 

GRAMMAR 
FILES 

DYNAMIC 
GRAMMAR 
GENERATOR 



SIGNAL 
INTERFACE 



20g 



HOST COMPUTER 



PROCESSOR 



21a 



21b 



JCMORY_ 

WEB 
BROWSER 



^21 



wn 



FIG. 2 



8 




9 



